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Abstract 

In this paper we describe the extension of the CAPO 
parallelization support tool to support multilevel 
parallelism based on OpenMP directives . CAPO 
generates OpenMP directives with extensions 
supported by the NanosCompiler to allow for directive 
nesting and definition of thread groups. We report first 
results for several benchmark < odes and one full 
application that have been parallelized using our 
system. 

1 Introduction 

Parallel architectures are an inst umental tool for the 
execution of computational intensive applications. 
Simple and powerful programming models and envi- 
ronments are required to develop a id tune such parallel 
applications. Current programming models offer either 
library-based implementations (such as MPI [16]) or 
extensions to sequential languages (directives and lan- 
guage constructs) that express the available parallelism 
in the application, such as OpenMP [19]. 

OpenMP was introduced as ar industrial standard 
for shared-memory programming with directives. Re- 
cently, it has gained significant popularity and wide 
compiler support. However, relevant performance is- 
sues must still be addressed which concern 
programming model design as well as implementation. 
In addition to that, extensions to the standard are being 
proposed and evaluated in order to widen the applica- 
bility of OpenMP to a broad class of parallel 
applications without sacrificing po lability and simplic- 
ity. 

What has not been clearly addressed in OpenMP is 
the exploitation of multiple levels parallelism. The lack 
of compilers that are able to exploit further parallelism 
inside a parallel region has been tl e main cause of this 
problem, which has favored the practice of combining 
several programming models to address scalability of 
applications to exploit multiple le\els of parallelism on 
a large number of processors. Th^ nesting of parallel 
constructs in OpenMP is a feature that requires atten- 


tion in future releases of OpenMP compilers. Some 
research platforms, such as the OpenMP NanosCom- 
piler [9], have been developed to show the feasibility 
of exploiting nested parallelism in OpenMP and to 
serve as testbeds for new extensions in this direction. 
The OpenMP NanosCompiler accepts Fortran-77 code 
containing OpenMP directives and generates plain For- 
tran-77 code with calls to the NthLib thread library 
[17] (currently implemented for the SGI Origin). In 
contrast to the SGI MP library, NthLib allows for mul- 
tilevel parallel execution such that inner parallel 
constructs are not being serialized. The NanosCompiler 
programming model supports several extensions to the 
OpenMP standard to allow the user to control the allo- 
cation of work to the participating threads. By 
supporting nested OpenMP directives the NanosCom- 
piler offers a convenient way to multilevel parallelism. 

In this study, we have extended the automatic paral- 
lelization tool, CAPO, to allow for the generation of 
nested OpenMP parallel constructs in order to support 
multilevel shared memory parallelization. CAPO 
automates the insertion of OpenMP directives with 
nominal user interaction to facilitate parallel process- 
ing on shared memory parallel machines. It is based on 
CAPTools [11], a semi-automatic parallelization tool 
for the generation of message passing codes, developed 
at the University of Greenwich. 

To this point there is little reported experience with 
shared memory multilevel parallelism. By being able 
to generate nested directives automatically in a reason- 
able amount of time we hope to be able to gain a better 
understanding of performance issues and the needs of 
application programs when in comes to exploiting mul- 
tilevel parallelism. 

The paper is organized as follows: Section 2 sum- 
marizes the NanosCompiler extensions to the OpenMP 
standard. Section 3 discusses the extension of CAPO to 
generate multilevel parallel codes. Section 4 presents 
case studies on several benchmark codes and one full 
application. 


*The author is an employee of Computer Sciences Corporation 


2 The NanosCompiler 

OpenMP provides a fork-and-join execution model 
in which a program begins execuiton as a single proc- 
ess or thread. This thread executes sequentially until a 
PARALLEL construct is found. At this time, the thread 
creates a team of threads and it becomes its master 
thread. All threads execute the statements lexically 
enclosed by the parallel construct. Work-sharing con- 
structs (DO, SECTIONS and SINGLE) are provided to 
divide the execution of the enclosed code region 
among the members of a team. All threads are inde- 
pendent and may synchronize at tie end of each work- 
sharing construct or at specific points (specified by the 
BARRIER directive). Exclusive execution mode is also 
possible through the definition of CRITICAL and 
ORDERED regions. If a thread in a team encounters a 
new PARALLEL construct, it creates a new team and it 
becomes its master thread. OpenMP v2.0 provides the 
NUM_THREADS clause to restrict the number of 
threads that compose the team. 

The NanosCompiler extensior to multilevel paral- 
lelization is based on the concept of thread groups. A 
group of threads is composed of a subset of the total 
number of threads available in the team to run a paral- 
lel construct. In a parallel constiuct, the programmer 
may define the number of groups and the composition 
of each one. When a thread in the current team encoun- 
ters a PARALLEL construct defining groups, the thread 
creates a new team and it becomes its master thread. 
The new team is composed of as many threads as the 
number of groups. The rest of the threads is used to 
support the execution of nested parallel constructs. In 
other words, the definition of groups establishes an 
allocation strategy for the inner levels of parallelism. 
To define groups of threads, the NanosCompiler sup- 
ports the GROUPS clause extension to the PARALLEL 
directive. 

C$OMP PARALLEL GROUPS (gspec) 

C$OMP END PARALLEL 

Different formats for the GROUPS clause argument 
gspec are allowed [10]. The Mmplest specifies the 
number of groups and performs an equal partition of 
the total number of threads to the zroups: 

gspec = ngroups 

The argument ngroups specifies the number of 
groups to be defined. This format assumes that work is 
well balanced among groups and therefore all of them 
receive the same number of threads to exploit inner 


levels of parallelism. At runtime, the composition of 
each group is determined by equally distributing the 
available threads among the groups. 

gspec = ngroups, weight 

In this case, the user specifies the number of groups 
(ngroups) and an integer vector (weight) indicat- 
ing the relative weight of the computation that each 
group has to perform. From this information and the 
number of threads available in the team, the threads are 
allocated to the groups at runtime. The vector weight 
is allocated by the user and its values are computed 
from information available within the application itself 
(for instance iteration space, computational complex- 
ity). 

3 The CAPO Parallelization Support Tool 

The main goal of developing parallelization sup- 
port tools, is to eliminate as much of the tedious and 
sometimes error-prone work that is needed for manual 
parallelization of serial applications. With this in mind, 
CAPO [13] was developed to automate the insertion of 
OpenMP compiler directives with nominal user inter- 
action. This is achieved largely by use of the very 
accurate interprocedural analysis from CAPTools [11] 
and also benefits from a directive browser to allow the 
user to examine and refine the directives automatically 
placed within the code. CAPTools provides a fully 
interprocedural and value-based dependence analysis 
engine [14] and has successfully been used to parallel- 
ize a number of mesh-based applications for distributed 
memory machines. 

3.1 Single level parallelization 

The single loop level parallelism automatically ex- 
ploited in CAPO can be defined by the following three 
stages (see [13] for more details of these stages and 
their implementation): 

]) Identification of parallel loops and parallel re- 
gions - this includes a comprehensive breakdown of 
the different loop types, such as serial, parallel includ- 
ing reductions, and pipelines. The outermost parallel 
loops are considered for parallelization so long as they 
provide sufficient granularity. Since the dependence 
analysis is interprocedural, the parallel regions can be 
defined as high up in the call tree as possible. This 
provides an efficient placement of the directives. 

2) Optimization of parallel regions and parallel 
loops - the fork-and-join overhead (associated with 
starting a parallel region) and the synchronizing cost 
are greatly lowered by reducing the number of parallel 
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regions required. This is achieved by merging together 
parallel regions where there is no violation of data us- 
age. In addition, the synchronization between 
successive parallel loops is removed if it can be proved 
that the loops can correctly execute asynchronously 
(using the NOWAIT clause). 

3) Code transformation and insertion of OpenMP 
directives - this includes the search for and insertion of 
possible THREADPRIVATE common blocks. There is 
also special treatment for privau variables in non- 
threadprivate common blocks. If there is a usage con- 
flict then the routine is cloned and the common block 
variable is added to the argumeni list of the cloned 
routine. Finally, the call graph is traversed to place 
OpenMP directives within the code. This includes the 
identification of necessary variable types, such as 
SHARED, PRIVATE, and REDUCTION. 

3.2 Extension to multilevel parallelization 

Our extension to OpenMP multilevel parallelism is 
based on parallelism at different loop nests. Multilevel 
parallelism can also be exploited with task parallelism 
but this is not considered, partly because task parallel- 
ism is not well defined in th<* current OpenMP 
specification. Currently, we limit vur approach to only 
two-level loop parallelism, which is of more practical 
use. The approach to automatically exploit two-level 
parallelism is extended from the s ngle level paralleli- 
zation and is illustrated in Figure 1. Besides the data 
dependence analysis in the beginn ng he approach can 
be summarized in the following four steps. 

I) First-level loop analysis. This is essentially the 
combination of the first two stages in the single level 
parallelization where parallel loops and parallel regions 
are identified and optimized at the outermost loop 
level. 



Figure 1: Steps in multilevel parallelization 


2) Second-level loop analysis . This step involves 
the identification of parallel loops and parallel regions 
nested inside the parallel loops that were identified in 
Step 1. These parallel loops and parallel regions are 
then optimized as before but limited to the scope de- 
fined by the first level. 

3) Second-level directive insertion. This includes 
code transformation and OpenMP directives insertion 
for the second level. The step performed before insert- 
ing any directives in the first-level is to ensure a 
consistent picture is maintained for any variables and 
codes that may be changed or introduced during the 
code transformation. 

4) First-level directive insertion. Lastly code trans- 
formation and OpenMP directives insertion are 
performed for the outer level parallelization. All the 
transformations of the last stage of the single level par- 
allelization are being performed, with the exception 
that we disallow the THREADPRIVATE directive. 
Compared to single level parallelization, the two-level 
parallelization process requires the additional steps 
indicated in the dash box in Figure 1. 


3.3 Implementation consideration 

In order to maintain consistency during the code 
transformations that occur during the parallelization 
process we need to update data dependencies properly. 
Consider the example, where CAPO transforms an 
array reduction into updates to a local variable. This is 
followed by an update to the global array in a 
CRITICAL section to work around the limitation on 
reduction in OpenMP vl.x. The data dependence graph 
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needs to be updated to reflect the change due to this 
transformation, such as associating dependence edges 
related to the original variable to the local variable and 
adding new dependences for the loc ll variable from the 
local updates to the global update Performing a full 
data dependence analysis for the modified code block 
is another possibility but this would not take advantage 
of the information already obtained from the earlier 
dependence analysis. 

When nested parallel regions are considered, the 
scope of the THREADPRIVATE directive is not clear 
any more, since a variable may be t nreadprivate for the 
outer nest of parallel regions but shared for the inner 
parallel regions, and the directive cannot be bound to a 
specific nest level. The OpenMP specification does not 
properly address this issue. Our solution is to disallow 
the THREADPRIVATE directive when nested parallel- 
ism is considered and treat an\ private variables 
defined in common blocks by a special transformation 
as mentioned in Section 3.1. 

The scope of the synchronization directives should 
be carefully followed. For example, the MASTER direc- 
tive is not allowed in the extent oi a PARALLEL DO. 
This changes the way a software pipeline (see [13] for 
further explanation) can be implemented if it is nested 
inside an outer parallel loop. 

When implementing a pipeline, ihe outer loop needs 
to be considered as well. This is illustrated by the fol- 
lowing example. Assume we have a nest containing 
two loops: 

DO K=1,NK 
DO J=2,NJ 

A ( J , K) = A(J,K) + 

The outer loop K is parallel and the inner loop J can be 
set up with a pipeline. After inserting directives at the 
second level to set up the pipeline, we have 

I $OMP PARALLEL 

DO K= 1 , NK 

l . .point-to-point sync directive 

! $OMP DO 

DO J=2 , NJ 

A ( J , K ) = A ( J , K ) + MJ-1,K) 

The implementation of the point-u -point synchroniza- 
tion with directives is illustrated in Section 4.2. In 
order to parallelize the K loop at the outer level, we 
need to first transform the loop inlo a form such that 
the outer-level directives can be added. It is achieved 
by explicitly calculating the K-lo.)p bound for each 
outer-level thread as shown in the following codes: 

!$OMP PARALLEL DO GROUPS ( ng roups ) 

DO IT= 1 , omp_ge t_num_t breads ( ) 

CALL calc„bound( IT, 1, NK, 

> low, high) 


! $OMP PARALLEL 

DO K= low, high 

/..point-to-point sync directive 

!$OMP DO 

DO J = 2 , N J 

A ( J , K ) = A { J , K ) + A ( J-l , K) 

The function “calc_bouncT calculates the K loop 
bound (low, high) for a given IT (the thread num- 
ber) from the original K loop limit. Only then are the 
first-level directives added to the IT loop (instead of 
the K loop). The method is not as elegant as one would 
prefer, but it points to some of the limitations with the 
nested OpenMP directives. In particular we would not 
be able to set up a two-dimensional pipeline, since it 
would involve synchronization of threads from two 
different nest levels. We will discuss the problem of 
two-dimensional pipelining in one of our case studies 
in Section 4.2. 

One of the contributions by the NanosCompiler to 
support nested directives is the GROUPS clause, which 
can be used to define the number of thread groups to be 
created at the beginning of an outer-nest parallel re- 
gion. In our implementation, the GROUPS directive 
containing a single shared variable ‘ngroups’ is gen- 
erated for all the first-level parallel regions. The 
ngroups variable is placed in a common block and 
can be defined by the user at run time. Although it 
would be better to generate the GROUPS clause with a 
weight argument based on different workloads of 
parallel regions, this is not considered at the moment. 

4 Case Studies 

In this section we show examples for successful and 
not so successful automatic mulitlevel parallelization. 
We have parallelized the three application benchmarks 
(BT, SP, and LU) from the NAS Parallel Benchmarks 
[4] and the ARC3D [21] application code using the 
CAPO multilevel parallelization feature and examined 
its effectiveness. 

In each of our experiments we generate nested 
OpenMP directives and use the NanosCompiler for 
compilation and building of the executables. As dis- 
cussed in Sections 2 and 3, the nested parallel code 
contains the GROUPS clause at the outer level. Accord- 
ing to the OpenMP standard, the number of executing 
threads can be specified at runtime by the environment 
variable OMP_NUM_TH READS. We introduce the envi- 
ronment variable NANOS_GROUPS and modify the 
source code to have the main routine check the value 
of this variable and set the argument to the GROUPS 
clause accordingly. This allows us to run the same ex- 
ecutable not only with different numbers of threads, 
but also with different numbers of groups. We compare 
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the timings for different numbers of groups to each 
other. Note that single level parallelization of the outer 
loop corresponds to the case that the number of execut- 
ing threads is equal to the number of groups, i.e. there 
is only one thread in each group. We compare these 
timings to those resulting from compilation with the 
native SGI compiler, which supports only the single 
level OpenMP parallelization and serializes inner par- 
allel loops. 

The timings were obtained on an SGI Origin 2000 
with R 12000 CPUs, 400MHz clock and 768MB local 
memory per node. 

4.1 Successful multilevel parallelization: the 
BT and SP benchmarks 

The NAS Parallel Benchmarks BT and SP are both 
simulated CFD applications with a similar structure. 
They use an implicit algorithm to solve the 3D com- 
pressible Navier-Stokes equations The x, y, and z 
dimensions are decoupled by usage of an Alternating 
Direction Implicit (ADI) factorization method. In BT, 
the resulting systems are block-tridiagonal with 5x5 
blocks. The systems are solved sequentially along each 
dimension. SP uses a diagonalization method that de- 
couples each block-tridiagonal system into three 
independent scalar pentadiagonal systems that are 
solved sequentially along each dimension. 

A study about the effects of sngle level OpenMP 
parallelization of the NAS Parallel Benchmarks can be 
found in [12]. In our experiments we started out with 
the same serial implementation of the codes that was 
the basis for the single level OpenMP implementation 
as described in [12]. We ran clas^ A (64x64x64 grid 
points), B (102x102x102 grid points), and C 
(162xl62x 162 grid points) for the BT and SP bench- 
marks. As an example we show timings for problem 
class A for both benchmarks in Figure 2. 


BT Class A (Problem size 64x64x64) 



0 16 32 64 126 


Number of threads 



Figure 2: Timing results for class A benchmarks. 

The programs compiled with the SGI OpenMP 
compiler scale reasonably well up to 64 threads, but do 
not show any further speed-up if more threads are be- 
ing used. For a small number of threads (up to 64), the 
outer level parallel code generated by the Nanos Com- 
piler runs somewhat slower than the code generated by 
the SGI compiler, but its relative performance im- 
proves with increasing number of threads. When 
increasing from 64 to 128 threads, the multilevel paral- 
lel code still shows a speed-up, provided the number of 
groups is chosen in an optimal way. We observed a 
speed-up of up to 85% for 128 threads. In Figure 3 we 
show the speed-up resulting from nested parallelization 
for three problem classes of the SP and BT bench- 
marks. We denote by 

• SGI OpenMP: the time for outer loop paralleli- 
zation using just the native SGI compiler, 

• Nanos Outer: the time for outer loop paralleliza- 
tion using the NanosCompiler, 

• Nanos Minimal: the minimal time for nested 
parallelization using the NanosCompiler. 

For the BT benchmark CAPO parallelized 28 loops, 
13 of which were suitable for nested parallelization. 
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For the SP benchmark CAPO para lelized 31 loops, 17 
of which were suitable for multilevel parallelism. In 
both benchmarks the most time consuming loops are 
parallelized in two dimensions. All of the nested paral- 
lel loops are at least triple nested. The structure of the 
loops is such that the two outer most loops can be par- 
allelized. The inner parallel loops enclose one or more 
inner loops and contain a reasonably large amount of 
computational work. 

The reason that multilevel parallelism has a positive 
effect on the performance of these loops is mainly due 
to the fact that load balancing between the threads is 
improved. For class A, for example, he number of it- 
erations is 62. If only the outer loop is parallelized, 
using more than 62 threads will rot improve the per- 
formance any further. In the case of 64 threads, 2 of 
them will be idling. If, however, the second loop level 
is also parallelized, all 64 thread^ can be put to use. 
Our experiments show that by choosing the number of 
groups too small, the performance will actually de- 
crease. Setting the number of groups to 1 effectively 
moves the parallelism completely to the inner loop, 
which will in most cases be less elficient than parallel- 
izing the outer loop. 

In Table 1 we show the maximal and minimal num- 
ber of iterations (for class A) of the inner parallel loop 
that a thread has to execute, depending on the number 
of groups. 


# Groups 

Max # Iters 

Min # Iters 

64 



32 


31 

16 

64 

45 

8 

64 

49 

4 

64 

45 


Table 1: Thread workload for the class A prob- 
lems BT and SP. 

To give a flavor of how the performance of the mul- 
tilevel parallel code depends on the grouping of threads 
we show timings for the BT benchmark on 64 threads 
and varying number of groups in I igure 4. The timings 
indicate that good criteria to choose the number of 
groups are: 

• Efficient granularity of the parallelism, i.e., the 
number of groups has to be sufficiently small. In 
our experiments we observe that the number of 
groups should not be smalle than the number of 
threads within a group. 


• The number of groups has to be large enough to 
ensure a good balancing of work among the 
threads. 


BT Benchmark with 64 Thread* 



Figure 4: Timings of BT with varying number of 
thread groups. 


4.2 The need for OpenMP extensions: the LU 
benchmark 

The LU application benchmark is a simulated CFD 
application that uses the symmetric successive over- 
relaxation (SSOR) method to solve a seven band 


BT Speed-up with Netted Parallelization 



□ Class A 
■ Class B 

□ Class C 


Number of Threads 


SP Speed-up with Nested Parallelization 



8 16 32 64 128 

Number of Threads 


Figure 3: Speed-up due to nested parallelism. 
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block-diagonal system resulting fr >m finite-difference 
discretization of the 3D compressible Navier-Stokes 
equations by splitting it into block lower and block 
upper triangular systems. 

As starting point for our tests we choose the pipe- 
lined implementation of the parallel SSOR algorithm, 
as described in [12]. The example below shows the 
loop structure of the lower-triangular solver in SSOR. 
The lower-triangular and diagonal systems are formed 
in routine JACLD and solved in routine BLTS. The 
index K corresponds to the third coordinate direction. 

DO K = KST , KEND 
CALL JACLD (K) 

CALL BLTS (K) 

END DO 

SUBROUTINE BLTS 

DO J = JST , JEND 
Loop_Body ( J , K) 

END DO 

RETURN 

END 

To set up a pipeline for the outer loop, thread 0 
starts to work on its first chunk o< data in K direction. 
Once thread 0 finishes, thread 1 can start working on 
its chunk for the same K and, in the meantime, thread 0 
moves on to the next K. CAPO detects such opportuni- 
ties for pipelined parallelism. The directives generated 
by CAPO to implement the pipeline for the outer loop 
are shown in Figure 5. 


! $OMP PARALLEL PRIVATE (K, iam, numt ) 

iam = omp_get_thread_num( } 
numt = omp_get_num_threads ( ) 
isync(iam) - 0 
!$OMP BARRIER 

DO K = KST, KEND 
CALL JACLD (K) 

CALL BLTS (K) 

END DO 

! $OMP END PARALLEL 

SUBROUTINE BLTS (K) 

if (iam . gt. 0 .and. 

iam .It. numt) then 
do while ( isync ( iam-1 ) . eq. 0) 

t $OMP FLUSH ( i sync ) 

end do 

isync ( iam-1 ) = 0 

! $OMP FLUSH (isync) 
end if 
! $OMP DO 

DO J = JST, JEND 
Loop_Body (J,K) 

END DO 

! $OMP END DO nowait 

if (iam .It. numt) then 

do while (isync (iam) .eq. 1) 

!$OMP FLUSH (isync) 

end do 

isync (iam) = 1 

l$OMP FLUSH (isync) 

endif 

RETURN 

END 

Figure 5: The one-dimensional parallel pipeline 
implemented in LU. 

The K loop is placed inside a parallel region. Two 
OpenMP library functions are called to obtain the cur- 
rent thread identifier (iam) and the total number of 
threads (numt). The shared array isync is used to 
indicate the availability of data from neighboring 
threads. Together with the FLUSH directive in a 
WHILE loop it is used to set up the point- to- point syn- 
chronization between threads. The first WHILE 
ensures that thread iam will not start with its slice of 
the J loop before the previous thread has updated its 
data. The second WHILE is used to signal data avail- 
ability to the next thread. 

The performance of the pipelined parallel imple- 
mentation of the LU benchmark is discussed in [12]. 
The timings show that the directive based implementa- 
tion does not scale as well as a message passing 
implementation of the same algorithm. The cost of 
pipelining results mainly from wait during startup and 
finishing. The message-passing version employs a 2 
dimensional pipeline where the wait cost can be greatly 
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reduced. The use of nested OpenMP directives offers 
the potential to achieve similar scalability to the mes- 
sage passing implementation. 

There is, however, a problem in setting up a direc- 
tive-based two-dimensional pipeline. The structure of 
the Loop_Body depicted in Figure > looks like: 

DO I = I LOW, IHIGH 
DO M = 1, 5 

TV(M, I, J) = V (M , I , v , K-l) 

+ V { M , I , J-l , K) 

+ V { M, 1-1 , J, K) 

END DO 

DO M = 1, 5 

V (M , I , J , K) = TV (M, I , J) 

END DO 
END DO 

If both J- and I-loop are to be pai al lelized employing 
pipelines, a thread would need to be able to synchro- 
nize with its neighbor in the J- and I-directions on 
different nesting levels. Parallelizing the I-loop with 
OpenMP directives introduces an inner parallel region, 
as shown below (see also the discussion in Section 3.3) 

1 $OMP PARALLEL 

synchronization 1 

l $OMP DO 

DO JT = ... 

! $OMP PARALLEL <- 

DO J = JLOW, JHIGH 

Synchroni zation2 

! $OMP DO 

DO I = I LOW, IHIGH 
END DO 

l $OMP END DO NOWAIT 

Synchronization 2 

END DO 

! $OMP END PARALLEL 

END DO 

■$OMP END DO NOWAIT 

synchronization! 

The end of the inner parallel regie n forces the threads 
to join and destroys the multilevel pipeline mechanism. 
In order to set up a 2-dimensional pipeline we would 
need to have the possibility of nested OMP DO direc- 
tives within the same parallel region. The 
NanosCompiler team is currently implementing 
OpenMP extensions to address this problem. A brief 
overview on this work is given in Section 6. 


4.3 Unsuitable loop structure in ARC3D 

ARC3D uses an implicit scheme to solve Euler and 
Navier-Stokes equations in a three-dimensional (3D) 
rectilinear grid. The main component is an ADI solver, 
which results from the approximate factorization of 
finite difference equations. The actual implementation 
of the ADI solver (subroutine STEPF3D) in the serial 
ARC3D is illustrated in Figure 6. It is very similar to 
the SP benchmark. 



Figure 6: The schematic flowchart of the ADI 
solver in ARC3D. 


For each time step, the solver first sets up boundary 
conditions (BC), forms the explicit right-hand-side 
(RHS) with artificial dissipation terms (FILTER3D), 
and then sweeps through three directions (X, Y and Z) 
to update the 5-element fields, separately. Each sweep 
consists of forming and solving a series of scalar pen- 
tadiagonal systems in a two-dimensional plane one at a 
time. Two-dimensional arrays are created from the 3D 
fields and are passed into the pentadiagonal solvers 
(VPENTA3 for the first 3 elements and VP ENT A for 
the 4 and 5th elements, both originally written for vec- 
tor machines), which perform Gaussian eliminations. 
The solutions are then copied back to the three- 
dimensional residual fields. Between sweeps there are 
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routines (TKINV, NPINV and T <) to calculate and 
solve small, local 5x5 eigensystems. Finally the solu- 
tion is updated for the current time step. 

We ran ARC3D for two different problem sizes. In 
both cases the performance dropped by 10% to 70% 
when the number of groups was smaller than the num- 
ber of threads, i.e. when multilevel parallelism was 
used. Example timings for both problem sizes and 64 
threads are given in Figure 7. The timings for outer 
level parallelism are given in Figure 8. 

Even though the time consuming solver in ARC3D 
is similar to the one in the SP benchmark, our approach 
to automatic multilevel parallelization was not success- 
ful. For ARC3D CAPO identified 58 parallel loops, 35 
of which were suitable for nested parallelization. 19 of 
the 35 nested parallel loops had very little work in the 
inner parallel loop and inefficient memory access. An 
example is shown below. 

I $OMP PARALLEL DO GROUPS ( ngroups ) 

1 $OMP& PRIVATE (AR, BR, CR, DR, ER) 

DO K = KLOW, KUP 

! $OMP PARALLEL DO 

DO L = 2 , LM 


DO J = 2, 
AR ( L , J ) 

JM 

= AR ( L , J ) 

4 

V ( J , K, L ) 

BR ( L , J ) 

= BR ( L , J ) 

4 

V{J, K, L) 

CR(L, J) 

= CR (L, J) 

4 

V ( J , K , L ) 

DR ( L , J ) 

= DR ( L , J ) 

4 

V( J, K, L) 

ER ( L , J ) 

= ER(L.J) 

4 

V ( J , K, L) 

CR (L, J) 

= CR (L, J) 

4 

1 . 


END DO 
END DO 
END DO 

Parallelizing the L loop increases the execution time of 
the loop considerably due to a high number of cache 
invalidations. The occurrence of many such loops in 
the original ARC3D code nullifies the benefits of a 
better load balance and we see no speed-up for multi- 
level parallelism. 


ARC3D Netted Parallelism Timings 



64x64*64 194*194x194 

Problem size 


Figure 7: Timings of ARC3D with varying num- 
ber of thread groups for a given total of 64 threads. 


ARC3D Timings for Problem Size 194x194x194 



6 16 32 64 128 

Number of threads 


ARC3D Timings for Problem Size 64x64x64 



Number of threads 


Figure 8: Timings from the outer level paralleli- 
zation of ARC3D. 

The example of APC3D shows that parallelizing all 
loops in an application indiscriminately on two levels 
with the same name number of groups and the same 
weight for each group may actually increase the execu- 
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tion time. At the least we will need to extend the 
CAPO directives browser to allow the user inspection 
of all multilevel parallel loops ard possibly perform 
code transformations or disable nested directives. 

5 Related work 

There are a number of commercial and research 
parallelizing compilers and tools that have been devel- 
oped over the years. Some of the more notable ones 
include Superb [24], Polaris [6], Suif [23] KAI’s tool- 
kit [15], V AST/Parallel [20], and FORGexplorer [1] 

Regarding OpenMP directives, most current com- 
mercial and research compilers mainly support the 
exploitation of a single level of parallelism and special 
cases of nested parallelism (e.g. double perfectly 
nested loops as in the SGI MIPSpro compiler). The 
KAI/Intel compiler offers, through a set of extensions 
to OpenMP, work queues and an interface for inserting 
application tasks before execution (WorkQueue pro- 
posal [22]). At the research level, the Illinois— Intel 
Multithreading library [7] provides a similar approach 
based on work queues. In both cases, there is no ex- 
plicit (at the user or compiler level) control over the 
allocation of threads so they do not support the logical 
clustering of threads in the multilevel structure, which 
we think is necessary to allow good work distribution 
and data locality exploitation. 

Compaq recently announced the support of nested 
parallel region by its Fortran compiler for Tru64 sys- 
tems [3]. The Omni compiler [18]. which is part of the 
Real World Computing Project, also supports nested 
parallelism through OpenMP directives. 

There are a number of papers reporting experiences 
in combining multiple programming paradigms (such 
as MPI and OpenMP) to exploit multiple levels of par- 
allelism. However, there is not much experience in the 
parallelization of applications with multiple levels of 
parallelism simply using OpenMP. Implementation of 
nested parallelism by means of controlling the alloca- 
tion of processors to tasks in a single-level parallelism 
environment is discussed in [5]. The authors show the 
improvement due to nested parallelization. 

Other experiences using nested OpenMP directives 
with the NanosCompiler are repored in [2]. In the ex- 
amples discussed there, the directives have not been 
automatically generated. 

6 Project Status and Future Plans 

We have extended the CAPO automatic paralleliza- 
tion support tool to automatically generate nested 
OpenMP directives. We used the NanosCompiler to 


evaluate the efficiency of our approach. We conducted 
several case studies which, showed that: 

• Nested parallelization was useful to improve load 
balancing. 

• Nested parallelization can be counter productive 
when applied without considering workload dis- 
tribution and memory access within the loops. 

• Extensions to the OpenMP standard are needed to 
implement nested parallel pipelines. 

We are planning to enhance the CAPO directives 
browser to allow the user to view loops, which are 
candidates for nested parallelization. Nested paralleli- 
zation may then be turned on selectively and necessary 
loop transformations can be performed. We are also 
considering the automatic determination of an appro- 
priate number of groups and the assignment of 
different weights to the groups. Currently CAPO is 
also being extended to support hybrid parallelism 
which combines coarse-grained parallelization based 
on message passing and fine-grained parallelization 
based on directives. 

OpenMP extensions are currently being imple- 
mented in the framework of the NanosCompiler to 
easily specify precedence relations causing pipelined 
executions. These extensions are also valid in the scope 
of nested parallelism. They are based on two compo- 
nents: 

• The ability to name work-sharing constructs (and 
therefore reference any piece of work coming out 
of it). 

• The ability to specify predecessor and successor 
relationships between named work-sharing con- 
structs (PREC and SUCC clauses). 

This avoids the manual transformation of the loop 
to access data slices and manual insertion of synchro- 
nization calls. From the new directives and clauses, the 
compiler automatically builds synchronization data 
structures and insert synchronization actions following 
the predecessor and successor relationships defined [8]. 
These relationships can cross the boundaries of parallel 
loops and therefore avoid the problems that CAPO 
currently has to implement two-dimensional pipelines. 

We plan to conduct further case studies to compare 
the performance of parallelization based on nested 
OpenMP directives with hybrid and pure message 
passing parallelism. 
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